Mh/full pipeline diffusion adjusted scores by moritzhauschulz · Pull Request #2491 · ecmwf/WeatherGenerator

moritzhauschulz · 2026-06-11T15:11:15Z

Description

This PR basically does two things:

It fixes the SSR computation. Previously, the member RMSE was used as denominator. I (and Claude) think this should be ensemble RMSE.
It fixes some dead-end ('return None') in score.py

This should probably be reviewed by someone from the eval team (maybe @iluise ?).

With this fix, and the fix in the eval config, I can now produce CRPS, SSR and spread plots and maps (pretty much out of the box).

Issue Number

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
[] I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2026-06-12T11:24:24Z

@moritzhauschulz : can we also open this against develop please.

moritzhauschulz · 2026-06-14T09:27:20Z

@clessig PR on develop is here

clessig · 2026-06-15T05:59:55Z

@clessig PR on develop is here

Thanks!

MatKbauer

Thanks, Moritz! I'm currently testing this. When I try to launch a training on this branch with

uv run --offline train --base-config config/config_diffusion_d2048_forecast.yml

I'm getting an error that seems related to empty predictions:

Traceback (most recent call last):
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/run_train.py", line 193, in run_train
    trainer.run(cf, devices)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 429, in run
    self.train(mini_epoch)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 549, in train
    self.grad_scaler.step(self.optimizer)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 454, in step
    len(optimizer_state["found_inf_per_device"]) > 0
AssertionError: No inf checks were recorded for this optimizer.
[3] > /e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py(454)step()
-> len(optimizer_state["found_inf_per_device"]) > 0

This does not happen to me on jk/develop/diffusion-full-pipeline, but in the changed files of this PR, I can't quite see where the problem comes from. @moritzhauschulz, do you have an idea?

moritzhauschulz · 2026-06-16T13:31:29Z

Thanks, Moritz! I'm currently testing this. When I try to launch a training on this branch with

uv run --offline train --base-config config/config_diffusion_d2048_forecast.yml

I'm getting an error that seems related to empty predictions:

Traceback (most recent call last):
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/run_train.py", line 193, in run_train
    trainer.run(cf, devices)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 429, in run
    self.train(mini_epoch)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 549, in train
    self.grad_scaler.step(self.optimizer)
  File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 454, in step
    len(optimizer_state["found_inf_per_device"]) > 0
AssertionError: No inf checks were recorded for this optimizer.
[3] > /e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py(454)step()
-> len(optimizer_state["found_inf_per_device"]) > 0

This does not happen to me on jk/develop/diffusion-full-pipeline, but in the changed files of this PR, I can't quite see where the problem comes from. @moritzhauschulz, do you have an idea?

Thanks for taking a look @MatKbauer. This is indeed strange, and I am getting the same. I have seen this error before but I don't recall when... On first glance it seems to be unrelated to my changes (which is why I didn't test the training, whoops). However, I am getting the same currently on jk/develop/diffusion-full-pipeline, so maybe it has to do with some package update that's not compatible?

github-project-automation Bot added this to WeatherGen-dev Jun 11, 2026

github-actions Bot added eval anything related to the model evaluation pipeline model Related to model training or definition (not generic infra) labels Jun 11, 2026

moritzhauschulz marked this pull request as ready for review June 14, 2026 10:15

MatKbauer reviewed Jun 16, 2026

View reviewed changes

moritzhauschulz added 6 commits June 18, 2026 10:32

reactivated assert

61fc25d

score fixes

15b06ba

more metrics

26a22e9

remove comments

451e9b2

remove docs

6896462

implemented spread adj and ssr adj

a26fded

moritzhauschulz force-pushed the mh/full-pipeline-diffusion-adjusted-scores branch from ad88e05 to a26fded Compare June 18, 2026 08:46

ssr_adj in config eval

9ebac94

Jubeku merged commit 246f05e into ecmwf:jk/develop/diffusion-full-pipeline Jun 18, 2026
1 check passed

github-project-automation Bot moved this to Done in WeatherGen-dev Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mh/full pipeline diffusion adjusted scores#2491

Mh/full pipeline diffusion adjusted scores#2491
Jubeku merged 7 commits into
ecmwf:jk/develop/diffusion-full-pipelinefrom
moritzhauschulz:mh/full-pipeline-diffusion-adjusted-scores

moritzhauschulz commented Jun 11, 2026 •

edited

Loading

Uh oh!

clessig commented Jun 12, 2026

Uh oh!

moritzhauschulz commented Jun 14, 2026

Uh oh!

clessig commented Jun 15, 2026

Uh oh!

MatKbauer left a comment

Uh oh!

moritzhauschulz commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

moritzhauschulz commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Jun 12, 2026

Uh oh!

moritzhauschulz commented Jun 14, 2026

Uh oh!

clessig commented Jun 15, 2026

Uh oh!

MatKbauer left a comment

Choose a reason for hiding this comment

Uh oh!

moritzhauschulz commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

moritzhauschulz commented Jun 11, 2026 •

edited

Loading